Presidential Speeches

Topical and Lexical Similarity

Harman Singh

harman.singh21@johncabot.edu

John Cabot University

Background

Much criticism of Donald Trump has centered on him being unfit for the Office of President of the United States, or his “unpresidentiality”.
Some of this criticism also emerges from Donald Trump’s rhetorical style, which has also been deemed “unpresidential”.
But what does “Presidentiality” mean? Are there common traits, character qualities, rhetorical styles, or other elements that are common to US Presidents?

Research Question

Have US Presidents throughout history given similar speeches and official addresses to each other?
What have been the most common topics they have given speeches on and how have they changed over time?

Data Sources

The Miller Center at the University of Virginia’s ‘Presidential Speeches’ has speeches available from George Washington till today are available in text format
Most speeches are official addresses, remarks, or statements
Available free to download as JSON file format to the public
Collection is not exhaustive, it is extensive and contains over 1,000 speeches within it

Methodology

Selecting the Data

Temporal shifts in American society; Realignment in American domestic/foreign policy
Only Presidents from 20th Century onward - Theodore Roosevelt (1901)
Only speeches while President, no campaigns or other speeches, consistency
Historical trend of topics relevant today but not include archaic topics (slavery, railroads, etc.)
Minimal cleaning to maintain semantic and contextual coherence for my models

BERTopic

BERTopic groups similar speeches using language patterns which identified key themes automatically with advanced language models
Looks at context of words in relation to each other to find and build common topics

Top five topics
Topics spoken about by US Presidents are temporally contingent

Cosine Similarity using Word Embeddings

Measures how similar speeches are by comparing them in vector space
Uses word embeddings to capture semantic meaning, not just keywords
Helps identify subtle language patterns and thematic connections

Cosine Similarity using TF-IDF

Compares speeches based on word frequency adjusted by overall rarity
Effective for spotting shared vocabulary across texts
Less suited for capturing semantic meaning compared to embedding strategy

Gerald Ford: 0.56; Donald Trump: 0.62

Conclusion

Firstly, that the proportion of topic prevalence in Presidential speeches is time dependent – it fluctuates in accordance with changes in the political spheres, either global or domestic.
Secondly, both with content and vocabulary, there is a similarity between Presidents from Coolidge until Clinton, after which there is a break and a set of new similarities that begin.
Thirdly and lastly, while Donald Trump’s rhetorical choices have been criticized as the great break from previous Presidents, this is only verifiable in terms of topic/thematic consistency and not in terms of vocabulary.

Appendix

This is a technical appendix for the operations performed to create this memo.

Part 1: Loading the Data

# load the file
import json as json

with open('speeches.json', 'r') as file:
  speeches = json.load(file)

#convert to a pandas DataFrame
import pandas as pd
df = pd.json_normalize(speeches)

Part 2: Cleaning and Organizing the Text

# We only want to keep Presidents who start from 1900 on and drop all others
keep_presidents = [
    "Theodore Roosevelt", "William Taft", "Woodrow Wilson",
    "Warren G. Harding", "Calvin Coolidge", "Herbert Hoover",
    "Franklin D. Roosevelt", "Harry S. Truman", "Dwight D. Eisenhower",
    "John F. Kennedy", "Lyndon B. Johnson", "Richard M. Nixon",
    "Gerald Ford", "Jimmy Carter", "Ronald Reagan",
    "George H. W. Bush", "Bill Clinton", "George W. Bush",
    "Barack Obama", "Donald Trump", "Joe Biden"
]

df_new = df[df["president"].isin(keep_presidents)]

df_new = df_new.drop(['doc_name', 'title'], axis=1)

# Chronological List
df_new.sort_values("date", inplace=True)
df_new.reset_index(drop=True, inplace=True)

# Define the dates in which the President came to office and when they left 
president_terms = {
    "Theodore Roosevelt": ("1901-09-14", "1909-03-04"),
    "William Taft": ("1909-03-04", "1913-03-04"),
    "Woodrow Wilson": ("1913-03-04", "1921-03-04"),
    "Warren G. Harding": ("1921-03-04", "1923-08-02"),
    "Calvin Coolidge": ("1923-08-02", "1929-03-04"),
    "Herbert Hoover": ("1929-03-04", "1933-03-04"),
    "Franklin D. Roosevelt": ("1933-03-04", "1945-04-12"),
    "Harry S. Truman": ("1945-04-12", "1953-01-20"),
    "Dwight D. Eisenhower": ("1953-01-20", "1961-01-20"),
    "John F. Kennedy": ("1961-01-20", "1963-11-22"),
    "Lyndon B. Johnson": ("1963-11-22", "1969-01-20"),
    "Richard M. Nixon": ("1969-01-20", "1974-08-09"),
    "Gerald Ford": ("1974-08-09", "1977-01-20"),
    "Jimmy Carter": ("1977-01-20", "1981-01-20"),
    "Ronald Reagan": ("1981-01-20", "1989-01-20"),
    "George H. W. Bush": ("1989-01-20", "1993-01-20"),
    "Bill Clinton": ("1993-01-20", "2001-01-20"),
    "George W. Bush": ("2001-01-20", "2009-01-20"),
    "Barack Obama": ("2009-01-20", "2017-01-20"),
    "Donald Trump": ("2017-01-20", "2021-01-20"),
    "Joe Biden": ("2021-01-20", "2025-01-20"),
    "Donald Trump": ("2024-01-20", "2025-04-27"),
}

df_new['date'] = pd.to_datetime(df_new['date'], format='ISO8601', utc=True, errors='coerce')
df_new = df_new.dropna(subset=['date']) # drop any na's
df_new['date'] = df_new['date'].dt.date

# Update dictionary to have start and end in a clean date format
for pres, (start, end) in president_terms.items():
    start_date = pd.to_datetime(start).date()
    end_date = pd.to_datetime(end).date() if end else pd.Timestamp.today().date()
    president_terms[pres] = (start_date, end_date)

# Drop speeches by any president in which they were not actively in office, ensure only presidential speeches in our data
def was_president_at_time(row):
    pres = row['president']
    date = row['date']
    
    if pres in president_terms:
        start, end = president_terms[pres]
        return start <= date <= end
    return False  


df_proper = df_new[df_new.apply(was_president_at_time, axis=1)].reset_index(drop=True)

Part 3: BERTopic Analysis

# Import in all relevant models 
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models.ldamodel import LdaModel
from gensim import corpora
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import numpy as np

# Makes text of speech into one constant string, no returns or breaks included
def clean_text(text):
    text = text.replace('\n', ' ')
    return text.strip()
df_proper['cleaned_text'] = df_proper['transcript'].apply(clean_text)

# Create chunks of 300 words of the speeches
def chunk_text(text, max_words=300):
    words = text.split()
    return [' '.join(words[i:i+max_words]) for i in range(0, len(words), max_words)]

df_proper['chunks'] = df_proper['cleaned_text'].apply(chunk_text)

# Create separate dataframe of all chunks flattened to analyze
docs_chunked = [chunk for chunks in df_proper['chunks'] for chunk in chunks]

# Add a speech id index
df_proper = df_proper.reset_index(drop=True)  
df_proper['speech_id'] = df_proper.index + 1

# Initialize BERTopic Model and apply it to the dataframe of just chunks 
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import contextlib
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP
umap_model = UMAP(random_state=42)

#embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
#vectorizer_model = CountVectorizer(stop_words="english")

topic_model = BERTopic(
  #embedding_model=embedding_model,
  #vectorizer_model=vectorizer_model,   # additional models that could be used included here
  calculate_probabilities=True,
  verbose=False,
  umap_model = umap_model,
  #top_n_words=7,
  #nr_topics="auto",
)

topics, probs = topic_model.fit_transform(docs_chunked)

# Get information on the topics identified by BERTopic, counts for each topic, associated keywords
topic_info = topic_model.get_topic_info()[['Topic', 'Count', 'Name', 'Representation']]

topic_info_simple = topic_info[["Topic", 'Count', "Representation"]]

# Remove spaces and joins keywords together for better visibility
topic_info_simple['Representation'] = topic_info_simple['Representation'].apply(lambda x: ' '.join(dict.fromkeys(x).keys()))

frequency_table = pd.DataFrame(topic_info_simple)

# Create a new Dataframe of just chunked text with requisite column names, fashioned just like our old DF
chunked_data = []

for idx, row in df_proper.iterrows():
    chunks = chunk_text(row['cleaned_text'], max_words=300)
    for chunk in chunks:
        chunked_data.append({
            "original_speech_id": row['speech_id'],
            "president": row['president'],
            "date": row['date'],
            "transcript": chunk
        })


df_chunked = pd.DataFrame(chunked_data)

# Create topic column that attaches associated topics to each chunk
df_chunked['topic'] = topics

# Removes Topic -1 (topics with keywords such as 'the', 'of', 'a', etc) and replaces chunks with the next most possible Topic that fits 
## Ensures that speech chunks all have the best associated topic and not this catch all extra words 
excluded_topics = [-1]

def reassign_topic(topic, prob_row):
    if topic in excluded_topics:
        sorted_indices = np.argsort(prob_row)[::-1]
        for idx in sorted_indices:
            if idx not in excluded_topics:
                return idx
        return topic  
    else:
        return topic

df_chunked["topic"] = [
    reassign_topic(t, p) for t, p in zip(df_chunked["topic"], probs)
]

# Map topic labels next to the Topic for better visualization
# Attaches the year a speech was given to each column 
topic_labels = {
    row["Topic"]: row["Representation"] 
    for _, row in topic_info_simple.iterrows()
}

df_chunked["topic_label"] = df_chunked["topic"].map(topic_labels)
df_chunked['year'] = pd.to_datetime(df_chunked['date']).dt.year

Data Visualization

# Creates a dataframe of the top 5, most present topics
top_5_topics = df_chunked['topic'].value_counts().head(5).index
df_top_5 = df_chunked[df_chunked['topic'].isin(top_5_topics)]

# Count the number of speeches for each given topic by year
df_count_by_year = df_top_5.groupby(['year', 'topic']).size().reset_index(name='count')

df_total_by_year = df_chunked.groupby('year').size().reset_index(name='total')
df_count_by_year = pd.merge(df_count_by_year, df_total_by_year, on='year')

# Create a proportion of the number of speeches with topics counted by number of total speeches
## Accounts for years where there may be less speeches or more 
df_count_by_year['proportion'] = df_count_by_year['count'] / df_count_by_year['total']
df_count_by_year['topic_labels'] = df_count_by_year["topic"].map(topic_labels)

# Create summary labels for each of the following topics for easier visualization on a graph
manual_labels = {
  0: "Vietnam War",
  1: "Health Care",
  7: "Banks, Credit, Gold",
  2: "Peace, Nations, War",
  3: "Rights, Blacks, White",
  
}

df_count_by_year['manual_labels'] = df_count_by_year["topic"].map(manual_labels)

# Create an area graph for the 5 topics together
library(ggplot2)
count_by_year <- reticulate::py$df_count_by_year

ggplot(count_by_year, aes(x = year, y = proportion, fill = manual_labels)) +
  geom_area() +
  theme_minimal() +
  labs(title = "Top 5 Topics in Presidential Speeches Over Time",
       x = "Year",
       y = "Proportion of Speeches",
       fill = "Topic") +
  scale_fill_viridis_d() +
  theme(plot.title = element_text(size = 12,face='bold'),
        legend.position = "bottom",
        legend.text = element_text(size = 6))

# Create line graph for each topic/graph by itself
ggplot(count_by_year, aes(x = year, y = proportion)) +
  geom_line(aes(color = manual_labels)) +
  facet_wrap(~ manual_labels, scales = "free_y") +  
  theme_minimal() +
  labs(title = "Top 5 Topics in Presidential Speeches Over Time",
       x = "Year",
       y = "Proportion of Speeches",
       color = "Topic") +
  scale_color_viridis_d() +
  theme(
    legend.position = "none",
    plot.title = element_text(size = 10,face='bold')
  )

Part 4: Cosine Similarity

Word Embedding

# Get embedded topics using the BERTopic model 

topics, probs = topic_model.transform(df_chunked["transcript"].tolist())
embeddings = topic_model._extract_embeddings(df_chunked["transcript"].tolist(), method="document")
  
df_chunked["embedding"] = list(embeddings)

# Group presidential speeches and embeddings by president, get average embedding to get just one for each president 
president_embedding = df_chunked.groupby("president")["embedding"].apply(
    lambda emb_list: np.mean(np.vstack(emb_list), axis=0)
    )

# Apply cosine similarity test to the data from above
from sklearn.metrics.pairwise import cosine_similarity

X = np.vstack(president_embedding.values)
cosine_similarity_matrix = cosine_similarity(X)

# Creates Dataframe of the cosine similarity scores and prints the scores of each president in comparison to each other
presidents = president_embedding.index.tolist()

similarity_df = pd.DataFrame(cosine_similarity_matrix, index=keep_presidents, columns=keep_presidents)

Word Embedding Visualization

# Print a heatmap for easier visualization using the viridis color scale for best visualization
library(reshape2)
library(viridis)


similarity_matrix <- reticulate::py$similarity_df
similarity_matrix <- as.matrix(similarity_matrix)
melted_matrix <- melt(similarity_matrix, varnames = c("president_1", "president_2"))

ggplot(melted_matrix, aes(x = president_1, y = president_2, fill = value)) +
  geom_tile(color = "white", linewidth = 0.3) +  
  scale_fill_viridis(
    option = "viridis",  # Try "magma", "plasma", or "inferno" for other easily visualizable variants
    direction = -1,     
    limits = c(min(melted_matrix$value), max(melted_matrix$value)) 
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 6),
    axis.text.y = element_text(size = 10),
    legend.position = "right",
    plot.title = element_text(size = 10,face='bold')  

  ) +
  labs(
    x = "President",
    y = "President",
    title = "Word Embedding - Cosine Similarity of Presidential Speeches",
    fill = "Similarity"
  ) +
  coord_fixed()

TF-IDF Cosine Similarity

# Import vectorizer and list of stopwords to clean our text
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Define function to remove stopwords from a text
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)

# Join all text into a singular string 
presidents_aggregated = df_proper.groupby('president')['cleaned_text'].apply(" ".join).reset_index()

# Cleans the text by removing stopwords
presidents_aggregated['cleaned_text'] = presidents_aggregated['cleaned_text'].apply(remove_stopwords)

# Vectorize the text and save the vectorized text seperately
vectorizer = CountVectorizer()
president_dfm = vectorizer.fit_transform(presidents_aggregated['cleaned_text'])

# Apply cosine similarity test and create a Dataframe 
pres_cosine = cosine_similarity(president_dfm,president_dfm)

similarity_df_2 = pd.DataFrame(
    pres_cosine,
    index=keep_presidents,
    columns=keep_presidents
)

TF-IDF Visualization

# Creates another heatmap in the same style as the previous one, using viridis for ease of visualization
similarity_matrix_2 <- reticulate::py$similarity_df_2
similarity_matrix_2 <- as.matrix(similarity_matrix_2)
melted_matrix_2 <- melt(similarity_matrix_2, varnames = c("president_1", "president_2"))


ggplot(melted_matrix_2, aes(x = president_1, y = president_2, fill = value)) +
  geom_tile(color = "white", linewidth = 0.3) +  
  scale_fill_viridis(
    option = "viridis",  # Try "magma", "plasma", or "inferno" for other easily visualizable variants
    direction = -1,     
    limits = c(min(melted_matrix_2$value), max(melted_matrix_2$value)) 
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 6),
    axis.text.y = element_text(size = 10),
    legend.position = "right",
    plot.title = element_text(size = 10, face='bold')  

  ) +
  labs(
    x = "President",
    y = "President",
    title = "TF-IDF Cosine Similarity of Presidential Speeches",
    fill = "Similarity"
  ) +
  coord_fixed()